{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab 8 - Mean Square Error and validation\n",
    "\n",
    "We will continue looking at the insurance data set from Lab 7."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import statsmodels.formula.api as smf\n",
    "import seaborn as sns\n",
    "import numpy as np\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn import datasets, linear_model\n",
    "from sklearn.model_selection import KFold\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the CSV file into a dataframe and display it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## More about dummy variables\n",
    "\n",
    "Change the `sex`, `smoker`, and `region` columns to dummy variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First remember the scatter plot showing the relation being age, smoker and insurance charges:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at a linear model with `age` and `smoker_yes` as the independent variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What's the linear model if $X_1$ is age and $X_2$ is smoker_yes?\n",
    "\n",
    "$$y = -2391.6264 + 274.8712x_1 + 23860x_2$$\n",
    "\n",
    "Consider all the non-smokers. For these people, what is $x_2$?  So what is the linear model for only non-smokers?\n",
    "\n",
    "Let's visualize this line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.relplot(x = \"age\", y = \"charges\", hue = \"smoker\", data = insurance)\n",
    "x = np.linspace(18,65,100)\n",
    "y = -2391.6264 + 274.8712*x\n",
    "plt.plot(x,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now consider only the smokers.  What is the value of $x_2$?  What is the linear model for the smokers?  \n",
    "\n",
    "Again, let's visualize the line for the smoker linear model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Are these the same two linear models we would get if we filtered our data set into smokers and non-smokers and separately computed the linear model for each?\n",
    "\n",
    "Try it below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mean Squared Error\n",
    "\n",
    "The error is the difference between the actual $y_i$ value and what was predicted $\\hat{y_i}$.  The *mean squared error (MSE)* is the mean of the squares of the error terms:\n",
    "\n",
    "$$MSE =\\frac{\\sum_i (y_i - \\hat{y_i})^2}{n}$$\n",
    "\n",
    "While error in linear regression is often measured by the residual sum of squares (RSS), the mean square error is similar, but used to measure error for other methods of prediction.\n",
    "\n",
    "Let's compute the MSE for the above linear model.  How can you get the error terms?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now square the error terms:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now take the mean of the squared error terms:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Altogether:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "((insurance_new[\"charges\"] - lm.fittedvalues)**2).mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's compare to the mean squared error with all columns as independent variables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Look at the p-values to determine which columns to remove from model.  Remove columns one at a time, as the p-values of the other columns will change."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the mean squared error:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training and test sets\n",
    "\n",
    "Humans learn some things by training (ex. doing homework or practicing a sport or musical instrument), and can check if they know them by testing (ex. test in a university class, a game for a sport, or a performance for a musical instrument).  When you take a test in a math class, say, the questions are different from the ones you practiced on for homework, to ensure you understand the concepts and didn't simply memorize the answers.  We want to test the computer's ability to make a prediction in a similar way.\n",
    "\n",
    "Therefore, we want to divide our data into *training data* and *test data*.  We will \"learn\" the linear model on the training data, and then make prediction on the new, previously-unseen test data.\n",
    "\n",
    "We can use the sci-kit learn package to split up our data.  Notice we are specifying which columns to use as the input data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(insurance_new[[\"age\",\"smoker_yes\"]], insurance[\"charges\"], test_size=0.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Display X_train, X_test, y_train, and y_test below.  Are they what you expect?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also make a linear model with Sci-kit learn:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "lm = linear_model.LinearRegression()\n",
    "model = lm.fit(X_train, y_train)\n",
    "predictions = lm.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the mean squared error:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get R-squared:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What happens if you use a different test size?  Are the scores of the predictions more or less variable?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What happens if you use more columns to make the predictions?  Are the scores of the predictions more or less variable?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}